Skip to content

feat(backend): Q5_K packed matmul + ARM NEON + Kotlin/Native cinterop path#734

Merged
michalharakal merged 4 commits into
developfrom
feature/q5k-neon-kernels
Jun 11, 2026
Merged

feat(backend): Q5_K packed matmul + ARM NEON + Kotlin/Native cinterop path#734
michalharakal merged 4 commits into
developfrom
feature/q5k-neon-kernels

Conversation

@michalharakal

Copy link
Copy Markdown
Contributor

Adds a first-class Q5_K packed in-kernel dequant-matmul to the CPU backend (it was previously only eagerly decoded to FP32), hand-written ARM NEON kernels, and a Kotlin/Native cinterop consumption path so the kernels run on the board binary (not just the JVM via FFM).

What's here

  • Q5_K (256-elt / 176-byte super-block): TensorEncoding.Q5_K, Q5_KTensorData/Q5_KBlockTensorData (5th-bit fold from qh), Q5KMatmulKernel SPI, scalar (commonMain) + Panama (JVM) + native-C kernels, DefaultCpuOps dispatch + lazy transpose, and a StreamingGgufParametersLoader Q5_K/Q6_K packed branch.
  • ARM NEON (behind #if __ARM_NEON; x86 keeps the scalar fallback): fp32, q8_0, q4k, q5k. CMake aarch64 branch -march=armv8.2-a+fp16+dotprod (no +i8mm — A55 lacks it). Cross toolchain + opt-in -PcrossArm64.
  • Kotlin/Native cinterop: CMake now also builds a static archive libskainet_kernels.a; linuxX64 + linuxArm64 targets with a shared nativeMain; NativeKn*MatmulKernel + NativeKnKernelProvider (+ installNativeKernels()) so K/N resolves the C kernels through KernelRegistry.

Verification

  • Q5_K bit-exact vs the DequantOps golden across blocks; native↔Panama↔scalar matmul parity; capability-matrix gate updated.
  • Kotlin/Native: cinterop kernel ↔ scalar parity + registry resolution green on linuxX64 (6 tests); compileKotlinLinuxArm64 + cinterop cross-compile from x86.
  • JVM/FFM path unchanged.

Board-verify-pending

The NEON paths are aarch64-syntax-validated (clang --target=aarch64) but not executed (x86 host, no QEMU). The final aarch64 binary link + NEON runtime parity need the SL2610 (or QEMU).

🤖 Generated with Claude Code

michalharakal and others added 4 commits June 10, 2026 23:13
Adds Q5_K as a packed in-kernel dequant-matmul format (previously Q5_K was
only eagerly decoded to FP32 on load), mirroring the existing Q4_K plumbing,
and hand-written ARM NEON paths for the native CPU kernels.

Q5_K (256-elt / 176-byte super-block: d, dMin, 12 packed scales, 32-byte qh
high-bit plane, 128-byte qs low nibbles; 5-bit code = lowNibble | (5th<<4)):
- TensorEncoding.Q5_K; Q5_KTensorData / Q5_KBlockTensorData (5th-bit fold).
- Q5KMatmulKernel SPI + matmulQ5K()/"Q5_K" in KernelProvider.supports().
- ScalarQ5_KMatmulKernel (commonMain/KN), PanamaVectorQ5_KMatmulKernel (JVM),
  native C skainet_q5k_matmul + NativeQ5KMatmulKernel (FFM); all registered.
- DefaultCpuOps matmul dispatch + lazy-transpose branches.
- StreamingGgufParametersLoader: Q5_K + Q6_K packed branches (a Q5_K_M GGUF
  now loads end-to-end instead of SKIP'ing most tensors).

Tests: Q5_KBlockTensorData bit-exact vs DequantOps golden across blocks;
native<->Panama<->scalar matmul parity; KernelSupportMatrixTest gate updated.

ARM NEON (behind #if __ARM_NEON in skainet_simd.h; x86 keeps the scalar
fallback, re-verified green):
- fp32 (broadcast+vfmaq), q8_0 (widen int8->f32+vfmaq), q4k/q5k (nibble
  unpack + dual code/input accumulators; q5k folds the qh 5th bit via a
  runtime-count vshlq_u8).
- CMake aarch64 branch: -march=armv8.2-a+fp16+dotprod (no +i8mm — A55 lacks
  it). Cross toolchain-aarch64.cmake + opt-in -PcrossArm64 gradle tasks;
  default x86 build unaffected.

BOARD-VERIFY-PENDING: the NEON paths are aarch64-syntax-validated (clang
--target=aarch64) but not executed (x86 host, no QEMU). Run the parity tests
under qemu-aarch64 or on the SL2610 before relying on them.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The hand-written matmul kernels were JVM-only (consumed via FFM), but the
SL2610 board binary is Kotlin/Native — it can't use the FFM wrapper. Add a
K/N consumption path via cinterop so the board gets the same C (and, on
aarch64, NEON) kernels.

- CMake builds a STATIC archive (skainet_kernels_static -> libskainet_kernels.a)
  alongside the SHARED lib; same sources + flags (incl. the aarch64 NEON march).
- cinterop .def (skainet_kernels.h -> sk.ainet.kernels.cinterop).
- linuxX64 target on the (previously jvm-only) module, linking the static
  archive into K/N binaries; link tasks depend on the CMake build.
- NativeKnQ5KMatmulKernel (linuxX64Main): calls skainet_q5k_matmul via cinterop
  with pinned arrays (zero-copy).

POC verified on the host (linuxX64): NativeKnQ5KMatmulKernelParityTest — the
cinterop kernel matches the commonMain ScalarQ5_KMatmulKernel across 4 shapes
(tests=4, failures=0). JVM/FFM path unchanged (jvmTest green). linuxArm64 board
target + NEON runtime check are the remaining step.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The K/N analogue of the JVM NativeKernelProvider (FFM): a KernelProvider
(priority 100) exposing the cinterop-backed Q5_K/Q4_K/Q8_0/Q4_0 matmul kernels,
plus installNativeKernels() to register it in KernelRegistry — the path the
eager runtime's DefaultCpuOps.chooseQuantizedMatmulHeap uses to resolve a
kernel. K/N has no ServiceLoader, so registration is an explicit call by the
consumer (scalar fallback for Q6_K etc. is registered separately from
skainet-backend-cpu).

Verified on linuxX64: NativeKnKernelProviderTest — installNativeKernels makes
native-cinterop the best-available provider, its Q5_K kernel is the
registry-resolved kernel, and it matches the scalar reference (6 K/N tests
green total).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Promote the K/N cinterop path from the linuxX64 POC to the real board target:
- linuxArm64 target with the same skainet_kernels cinterop; links the aarch64
  cross-built static archive (cmake-build-arm64/libskainet_kernels.a, NEON).
- Shared `nativeMain` source set holds NativeKn*MatmulKernel + the provider, so
  linuxX64 and linuxArm64 share one implementation (cinterop bindings are
  commonized across both targets).
- linuxArm64 link tasks depend on the aarch64 cross-build only under -PcrossArm64
  (toolchain present); a plain host build still compiles linuxArm64 to a klib.

Verified on host: compileKotlinLinuxArm64 + cinteropSkainetKernelsLinuxArm64
succeed (cross-compiled from x86); linuxX64Test still green (6 tests) on the
shared nativeMain. Final aarch64 binary link + NEON runtime are board-verify-pending.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@github-actions

Copy link
Copy Markdown

📖 Documentation Preview

The documentation has been built successfully for this PR.

Generated Files:

  • Operator documentation: docs/modules/operators/_generated_/
  • JSON schema output: operators.json

Artifacts:

  • Download the documentation-preview-734 artifact to view the complete documentation locally.

This comment will be updated automatically when the PR is updated.

@michalharakal michalharakal merged commit 92485f2 into develop Jun 11, 2026
14 checks passed
@michalharakal michalharakal deleted the feature/q5k-neon-kernels branch June 11, 2026 12:51
@michalharakal michalharakal mentioned this pull request Jun 13, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant